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Abstract 

In the shortest superstring problem, we are given a set of strings {si,...,Sfc} and 
want to find a string that contains all as substrings and has minimum length. This 
is a classical problem in approximation and the best known approximation factor is 2^, 
given by Sweedyk [19] in 1999. Since then no improvement has been made, howerever two 
other approaches yielding a 2i-approximation algorithms have been proposed by Kaplan 
et al. [10] and recently by Paluch et al. [16] — both based on a reduction to maximum 
asymmetric TSP path (Max-ATSP-Path) and structural results of Breslauer et al. [5]. 

In this paper we give an algorithm that achieves an approximation ratio of 2^i, 
breaking through the long-standing bound of 2|. We use the standard reduction of 
Shortest-Superstring to Max-ATSP-Path. The new, somewhat surprising, algo- 
rithmic idea is to take the better of the two solutions obtained by using: (a) the currently 
best |-approximation algorithm for Max-ATSP-Path and (b) a naive cycle-cover based 
^-approximation algorithm. To prove that this indeed results in an improvement, we fur- 
ther develop a theory of string overlaps, extending the results of Breslauer et al. [5]. This 
theory is based on the novel use of Lyndon words, as a substitute for generic unbordered 
rotations and critical factorizations, as used by Breslauer et al. 

1 Introduction 

The Shortest Superstring Problem In the Shortest-Superstring problem we are 
given a set of strings {si , . . . , Sk} and want to find a string that contains all Sj as substrings and 
has minimum length. The problem has several applications including data compression JS] [TH] 
and DNA sequencing \1'S\ [T4"l [T71 I24j . In the latter, one attempts to reconstruct a DNA 
molecule, which is a string over the alphabet {A,C,G,T}, based on a massive set of short 
fragments. These fragments (i.e. substrings) of the molecule can be obtained by sequencing. 
The reconstruction problem can be viewed as a shortest superstring problem based on the 
premise that the original molecule is a superstring of all the fragments, and that shorter 
superstrings should in general be more similar to the original. 



Previous Results Since Shortest-Superstring is NP-hard [9] and even MAX-SNP- 
hard [4, 23], the best we can hope for in terms of approximation is a constant factor. A lot 
of effort went into designing approximation algorithms for the problem, Table [TJ summarizes 
these developments. Note that the last two results, by Kaplan et al. [10] and Paluch et al. |16] 

'This work was partially supported by the ERC StG project PAA1 no. 259515 
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Authors 


Date 


Factor 


Li p3] 


1990 


0(log(n)) 


Blum, Jiang, Li, Tromp, Yannakakis [I] 


1991 


3 


Teng, Yao [21] 


1993 


2| 


Czumaj, Gasieniec, Piotrow, Rytter [7] 


1994 


4 


Kosaraju, Park, Stein [T2] 


1994 


9 50 
Z 63 


Armen, Stein [I] 


1995 


2| 


Armen, Stein [2] 


1996 


4 


Breslauer, Jiang, Jiang [5] 


1997 


9 25 
z 42 


Sweedyk [19] 


1999 


2± 

Z 2 


Kaplan, Lewenstein, Shafrir, Sviridenko [lOj 


2005 


2± 
Z 2 


Paluch, Elbassioni, van Zuylen |16j 


2012 


2± 

Z 2 



Table 1: Previous results for the Shortest- Superstring problem. 



do not improve the approximation factor. They both give |-approximation algorithms for the 
related Max-ATSP-Path problem. Using a black-box reduction due to Breslauer et al. [5], 
these give 2^-approximation algorithms for Shortest-Superstring. Both, especially the 
one due to Paluch et al., are significantly simpler than the original result of Sweedyk. 

Parallel to these developments, some progress has been made towards resolving the Greedy 
Superstring Conjecture (see [TBI EQl [22] ) , which says that the greedy approach of repeatedly 
picking the two strings that overlap the most and gluing them together until only a single 
string remains, is actually a 2- approximation. Blum et al. [4j showed that the greedy algorithm 
gives a 4- approximation, and Kaplan et al. [UJ improved this to 3^. 

Our Results/Techniques In this paper we develop several results that describe the struc- 
ture of the overlaps of a collection of strings. Our results can be viewed as an extension of 
the framework introduced by Breslauer et al. [5]. However, while Breslauer et al. use generic 
unbordered rotations and critical factorizations, we construct ours by using Lyndon words. It 
turns out that the added control we gain in this way allows for much more precise structural 
analysis of string overlaps. 

We use these results to obtain a 2 ^-approximation for Shortest-Superstring, and 
therefore break a long-standing bound of 2^. 

The basic idea of our approach is the following. For two strings u, v, let the overlap of 
u and v, denoted ov(u, v), be the longest suffix of u which is also a prefix of v. The overlap 
graph of a set of strings S is a complete directed graph on S with edge weights equal to 
lengths of corresponding overlaps. 

Blum et al. [4J show how approximating Shortest-Superstring for a set of strings S 
can be reduced to approximating the problem of finding a longest path in the overlap graph 
of a certain auxiliary set of strings R(S), called representative strings. The performance of 
the resulting algorithm depends on how well we can bound the overlap loss in the longest 
path approximation. 

This bound can essentially be improved in two ways: by using a better approximation 
algorithm for the longest path problem in directed graphs (Max-ATSP-Path), or by pro- 
viding a better bound on the overlap of the optimum path. For the first direction, Kaplan et 
al. [ID] and Paluch et al. |16] both give |-approximation for Max-ATSP-Path, which is the 
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best known. For the second, the bounds given by Breslauer et al. [5] are essentially tight. 

In this paper we propose a third way to improve by joining the two objectives. Note that 
one can approximate Max-ATSP-Path by finding a maximum weight cycle cover, removing 
the lightest edge on each cycle, and then joining the resulting paths with arbitrary edges. This 
naive algorithm only gives ^-approximation, significantly weaker than |, the tight case being 
balanced 2-cycles. We observe however, that with a careful choice of representative strings 
R(S), if the bounds given by Breslauer et al. are nearly tight, the cycles in the maximum 
weight cycle cover are far from balanced. So far in fact, that choosing the better of the two 
solutions: one given by a |-approximation algorithm, and one given by our naive algorithm, 
results in an approximation algorithm for Shortest-Superstring with ratio strictly smaller 
than l\. 

It is worth noting that, similarly to the approach of Breslauer et al., our algorithm is 
a black-box reduction from Shortest-Superstring to Max-ATSP-Path. Therefore, any 
improvements on the approximation factor for the latter will yield an improvement for the 
former. 

Organization of the Paper The paper is organized as follows. In Section[2]we recall some 
facts regarding the properties of strings and their overlaps, as well as the standard approach 
to shortest string approximation. In Section we describe the new algorithm and analyse its 
approximation factor. This analysis relies on Theorem 13.21 which is the main technical result 
of this paper. The remaining part of the paper is devoted to proving this theorem. 

In Section 0] we present some general bounds concerning overlaps of strings. We believe 
they might be of independent interest. In Section [5] we use these bounds to prove the main 
theorem. Since the proof is a rather long and detailed case analysis, to facilitate understanding 
of the basic ideas of the paper, in Subsection 15.21 we give a simple proof of a weaker version 
of the main theorem. This version still gives an approximation factor smaller than 2^. 

Finally in Section [6] we show that Theorem 13.21 is essentially tight. We also briefly discuss 
reasons why using our bounds to improve the analysis of the greedy algorithm might be 
difficult. 

2 Preliminaries 

In this section we recall some definitions, results and ideas concerning basic properties of 
strings. For a more extensive exposition the reader should consult any of the standard text- 
books on combinatorics on words, e.g. the excellent monograph by Lothaire |15j . 

We also describe the standard framework for Shortest-Superstring approximation. 
Our presentation mostly follows that of Breslauer et al. [5]. Note however, that instead of 
generic critical factorizations we use nice rotations, introduced at the end of Subsection 12.11 
This requires almost no changes in the framework, except for the proof of Lemma 12.71 which 
we provide. 

2.1 Stringology 

Basic concepts For a string v, we will use v[i] to denote the i-ih letter of v, and v[i, j] to 
denote the substring of v consisting of letters i, . . . ,j. We will use vu to denote concatenation 
of v and it, and v k to denote the concatenation of k copies of v. We will also use v°° to denote 
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the semi-infinite string vvv .... Any representation of w = uv as a concatenation of two (not 
necessarily nonempty) strings is called a factorization of w. The factorization is nontrivial if 
both u and v are nonempty. 

For a string w of length n, any integer 1 < p < n is a period of iu if w[i] = w[i + p] for all 
1 < i < n — p. Note that w always has at least one period, that is its length. The smallest 
period of w is called the minimum period of w or simply the period of w, and denoted p(w). 

A string w is primitive if there is no v such that w = v k with k > 2. 

A string z is called a rotation of u> if there exists a factorization w = uv such that z = to. 
In that case we also say that z is a rotation starting at position \u\ + 1, or that \u\ + 1 is #'s 
starting position in w. It is easy to see that if z is a rotation of w, then z is primitive iff 
w is. It is also a standard fact that if w is primitive, then any rotation corresponding to a 
nontrivial factorization of w is different from w. More generally, for a primitive w and two 
different factorizations w = u\V\ and w = U2V2, the rotations v\u\ and t^?^ are different. It 
follows that for primitive w, every rotation of w has a unique starting position in w. 

We say that two strings are equivalent if one is a rotation of the other. Otherwise they 
are non-equivalent. 

We will assume a fixed order on the alphabet. This order induces a standard lexicograph- 
ical order on the set of strings. We use u -< v to denote that u is lexicographically smaller 
than v, and u ■< v to denote that u is smaller or equal to v. 

Let wbea primitive string and consider the order induced by -< on all rotatations of w. 

Let 

"f^min and w ma!( be the minimal and maximal rotations in this order. Also, denote by 
irmn{w) and i m3X _{w) the starting positions of w m \ n and w max in w. Moreover, let p m m(w) and 
PmaxO) be strings such that w min = p m in(>)Pmax(>) and w max = p max (w)p m i n (w) . 

Lemma 2.1. Let W; \w\ ^ 2 6e a primitive string. Then both Pmin 

(w) and p max {w) are 

nonempty. In other words, w m \ n 7^ w max . 

Proof. Since \w\ > 2 and w is primitive, it contains at least 2 different letters. It follows that 
w m i n and u> max start with different letters, and so they are different strings. In particular 
both p m i n (w) and p max (w) are nonempty. □ 

The following property of u; m i n and w ma _ x is implicitly used by Crochemore et al. [6]. 

Lemma 2.2. Let w, \w\ > 2 be a primitive string. Then w max is the only rotation of w that 
starts with p m s,x{w), and u; m i n is the only rotation that starts with p m i n (w). 

Proof. We will prove the claim for p mEK (i«), the other part of the proof is analogous. Suppose 
that there is a rotation z — Pmax(w)^ of w, other than if max — Pmax (w)pmm(w), that starts 
with p m ax(w). We claim that z >- u> max which is a contradiction. To see that, notice that they 
both start with p max (w), and v >: p m m(,w) since Pmi n (w) is a prefix of the minimal rotation 
of w. The claim follows, since v 7^ p m m(w)- D 

Borders A nonempty string b is called a border of a string w if w = bu = vb for some 
nonempty u, v. A string is unbordered if it has no border. So, a string w is unbordered if 
w has no proper prefix that is also a suffix of w. The following is a standard fact (see e.g. 
Proposition 5.1.2 in Lothaire |15j). 

Lemma 2.3. Every primitive string w has a rotation that is unbordered. In particular, w m [ n 
and w max are unbordered. 
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Proof. Suppose that u> max has a border, i.e. there exists a proper prefix v of w ma , x which is 
also its suffix. Let \w\ = |w m ax| = n and \v\ = k < n. Since v is a suffix of w max , we know that 
u = («w max )[l, n] is a rotation of iu. We claim that u >- w max , which is a contradiction. To see 
this, notice that u[l, k] = v = w max [l, k] and also u[k + l, n] = w max [l, n — k] >z w max [k + 1, n], 
since w max is maximal. So u y w max , but this cannot be an equality since rotations of a 
primitive string are all different. 

The same proof applies to w m i n or one can simply notice that it is a maximal rotation in 
the lexicographical order induced by the reversed order on the alphabet. □ 

Remark. A primitive string w such that w = u> max is called a Lyndon wor cQ (w .r.t. the 
particular order on the alphabet that is used to define the lexicographical order). Note that the 
two rotations appearing in Lemma \2.3\ are Lyndon words, and in fact this Lemma is equivalent 
to saying that Lyndon words are unbordered. 

Remark. One of the key ingredients of the results of Breslauer et al. /5J/ is the notion of a crit- 
ical factorization and ,,The Critical Factorization Theorem" (see Cesari et al. fE^j). Although 
we do not use them directly, a reader acquainted with Breslauer et al. J3J/ will realize that they 
are nevertheless present in our work. In particular p Ui \ n (w)p max (w) and p max (w)p m i n (w) are 
critical factorizations (this fact was used by Crochemore et al. in their proof of the Critical 
Factorization Theorem). 

Nice rotations Let w, \w\ > 2 be a primitive string. The nice rotation of w is de- 
fined to be tt> max if bmax(^)| < bmin(^)l) otherwise it is defined to be u; m i n . Let a(w) = 
min(\p m3X (w)\, \pmin(w)\). We will call a primitive string nice if it is its own nice rotation. 
Note that if w is nice, then: 

• w = w max and a(w) = \p ma , x (w)\ < \w\/2, or 

• w = w m - m and a(w) = \p m - m (w)\ < \w\/2. 

In particular we always have a(w) < \w\/2. 

For a nice string w, we call x a w-string if x is a prefix of u>°°. 

2.2 Shortest Superstring Approximation 

Basic ideas In the remainder of this paper we assume w.l.o.g. that S contains at least two 
strings and that no string in S is a substring of another string 

For two strings u, v define the overlap of u and v, denoted ov(u,v), as the longest suffix 
of u that is also a prefix of v. Also, define the prefix of u w.r.t. v, denoted pref(u, v), as the 
string x such that u = xov(u,v), i.e. prefix is the part of u that does not overlap v. 

The following two directed graphs are good models of how the strings in S overlap with 
each other. The overlap graph of S is a complete directed graph with S as the vertex set, 
and edge (si,Sj) having length [ov(sj, Sj)\. The prefix graph (also called the distance graph) 
is defined similarly, only edge (si,Sj) now has length |pref(sj, Sj)\. 

1 We use term "word" here and not "string" as "Lyndon word" seems to be a well established phrase. In 
general, the algorithmic community tends to use the term "string" and the combinatorial community uses the 
term "word". We decided to follow this rule and use the term "string" with the single exception of "Lyndon 
word" . 
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Let (s^Sia,...^) be the string pref(s ll5 Sj 2 )pref(s l2 , s ia ) . . . pref(si n _ 1 , s in )s in . Obvi- 
ously, it is the shortest string containing 3^,8^, ... , Si n in that order. Notice that the optimal 
solution has the form (s^ , Sj 2 , . . . , Sj„) for some ordering s^, Sj 2 , . . . , Si n of the strings in S. 

The length of (sj 15 Sj 2 , . . . , Si n ) is equal to 

Ipref^SjJI + |pref(s i2 ,Sj 3 )| + ...+ \pief(s in ^,Si n ) \ + |pref (s in , ) | + |ov(sj ra , a^JI, 

which is the length of the cycle — > Si 2 — > . . . — > si n in the prefix graph of S increased by 
|ov(sj n , Si x )\. Thus, the length of the shortest TSP tour in the prefix graph of S lowerbounds 
the length of the shortest superstring. 

The above considerations suggest that reduction to asymmetric TSP might be useful 
in approximating Shortest-Superstring. Unfortunately, the best known approximation 
algorithm for asymmetric TSP has factor O ( see [3])' so this approach is not very 

useful. 

Let us look again at a generic solution (s^, Sj 2 , . . . , Sj n ) and this time express its length 
in terms of the overlap graph: 

n n—1 

|(s i:L , s i2 , . . . , s^}| = ^2\ Sj \ -^2\ov(si.,s ij+1 )\. 
3=1 i=i 

The right term in the above expression (the total overlap of (s^,^, . . . ,Sj n )) is the length 
of the path , . . . , Si n in the overlap graph, so the longest TSP path in the overlap graph 
corresponds to the optimal solution for Shortest-Superstring. Longest TSP path in 
a directed graph (called Max-ATSP-Path) can be approximated within constant factor. 
Notice however, that this does not lead to a constant factor approximation for Shortest- 
Superstring. The problem is that the total overlap of the optimal solution could be very 
large compared to its length. In that case even a very good approximation algorithm for total 
overlap might only give mediocre approximation for the length of the superstring. 

Two-step reduction to Max-ATSP-Path We can avoid the problems described in the 
previous paragraph by using the following two-step approach introduced by Blum et al. [lj: 

1. Find a minimum cycle cover C m ; n in the distance graph. 

2. For each cycle C € C m i n construct a representative string R{C) containing all strings in 
C as substrings, let R = R(C m \ n ) = {R(C) : C € C m i n }. 

3. Find a Shortest-Superstring solution for R by reducing to Max-ATSP-Path. 

The idea here is that the first step groups strings with large overlaps together, so that the 
overlaps of the strings in R are relatively small, and then the last step actually gives good 
approximation. 

The following series of lemmas and definitions from Blum et al. [4J and Breslauer et al. [5] 
gives an idea of why this approach works. 

For any cycle C = -)• s i2 s ik in C min let R(C) = (s ix , s i2 , . .. ,Si k ,s h ). Note 

two interesting features of this definition. First, depending on where we break the cycle we 
can start R{C) with any of the strings , . . . , Si k . Second, R(C) is actually "too long" as it 
unnecessarily contains two copies of — this will turn out useful later on. 

Let OPT(S) and OPT(R) be the lengths of optimal Shortest-Superstring solutions 
for S and R. 
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Lemma 2.4 (Follows from Lemma 2.6 of [5], also implicit in [3]). 

OPT(R) < 20PT{S). 

For a cycle C = — > Si 2 —)•... — > in the prefix graph define 

s(C) = pref^^Jpref^^), . . . ,pref(s i& , s h ). 

Then |s(C)| is the length of C and s(C) essentially reads the prefixes along the cycle. The 
strings s(C) for C G C m \ n have very interesting properties. 

Lemma 2.5 (Claims 3 and 5 in Blum et al. [5J ). The strings s{C) all all primitive with 
s{C) > 2 and are all non- equivalent. 

Let w(C) be the nice rotation of s(C) for every C G C m in* As we already mentioned, the 
representative strings R(C) defined as earlier are unnecessary long. This can be used to prove 
the following. 

Lemma 2.6 (Special case of Lemma 5.1 in Breslauer et al. [5]). One can define the repre- 
sentative R(C) for a cycle s^ — > . . . — > Si k so that: 

• R(C) is a substring of (si j , Si j+1 , . . . , Sj fc , , . . . , s^) for some j ( in particular OPT(R) < 
20PT(S) still holds), 

• R{C) is a w(C)-word. 

Finally, we need to show that the strings R(C) do not overlap too much. The lemma 
below is stated in a slightly more general fashion so that it can be used more easily later on. 

Lemma 2.7 (Implicit in the proof of Lemma 3.3 in Breslauer et al. [5]). Let w\ and W2 
be non- equivalent nice words and let X{ be a Wi-word for i = 1,2. Also let a» = a{wi) and 
li = \wi\. Then \ov(xi,X2)\ < h + CY2- 

Proof. Assume for a contradiction that |ov(xi,X2)| > h + «2- Consider the string z = 
ov(xi,X2)[/i + 1, Zi + 0*2]. We have z = ov(xi, 22ML 02] = ^2[lj02]) so we need to have 
l\ = M2 for some k because of Lemma 12.21 which is impossible. 

To see why, notice that if l± = M2 and |ov(xi,X2)| > h, then either w\ and W2 are 
equivalent (if k = 1) or w± is nonprimitive (if k > 1). □ 

Theorem 2.8 (Breslauer et al. [5]). Given c- approximation for Max-ATSP-Path, one can 
approximate Shortest-Superstring with approximation factor of 3^ — l|c. 

Proof. Consider the string s = {R{C\ ),..., R{Ck)) that is the optimal solution for R. Let 
Rov be the total overlap of this string. Then by applying Lemma 12.71 to every pair of 
consecutive strings we get 

Rov < £ {\w[Ci)\ + a(w(C t+l ))) <\^2 1^)1 < \0PT(S). 

i=l i=l 

A c-approximation algorithm for Max-ATSP-Path can be used to obtain a solution with to- 
tal overlap of cRov- The length of the resulting Shortest-Superstring solution is therefore 
at most 

OPT(R) + |(1 - c)OPT(S) < 20PT(S) + |(1 - c)OPT(S) = (3^ - l\c ) OPT{S). 

□ 
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Since |-approximation algorithms for Max-ATSP-Path are known we obtain the follow- 
ing 

Corollary 2.9 (Kaplan et al. [TO], also Paluch et al. [16] ). There exists a 2^- approximation 
algorithm for Shortest-Superstring. 

3 The Algorithm 

In this section we give the new approximation algorithm and bound its approximation factor. 

Description The algorithm we are going to analyse is very simple. It returns a solution Sq 
which is the better of the following two solutions Si, 

• S\ is obtained by using any algorithm that reduces Shortest-Superstring to Max- 
ATSP-Path (e.g. one due to Kaplan et al. [TO] or Paluch et al. [TO]). 

• S*2 is also obtained by reducing to Max-ATSP-Path, but this time we get the final 
solution by computing the maximum weight cycle cover in the overlap graph of R and 
dropping the lightest edge from every cycle. 

Analysis For any cycle C in the overlap graph let Oc be the total overlap of C, i.e. sum of 
the weights of its edges. Let Mq be the minimum weight of an edge of C. Also, let Lq be the 
sum of the periods of the strings in C, which is equal to the total length of the corresponding 
cycles in C min . 

Let \R\ be the total length of the representative strings in R and let C be the maximum 
weight cycle cover in the overlap graph of R. Note that YlceC^c = vj(Cmm)- Moreover, let 
Oc = ScgC^C' ^ = SceC-^C' Finally, let c be the best known approximation ratio 
for Max-ATSP-Path. 

Lemma 3.1. \Si\ < OPT(R) + (1 - c)O c and \S 2 \ < OPT(R) + M c . 

Proof. We have — Oc < OPT{R) and |S*2| < \R\ — Oc + Mc which proves the second part. 

For the first, note that since Oc > |-R| — OPT(R), any algorithm that approximates 
\R\ -OPT(R) with factor c gives a solution of length at most \Si\ < \R\- c(\R\- OPT(R)) = 
OPT(R) + (1 - c)(\R\ - OPT(R)) < OPT(R) + (1 - c)O c . □ 

The main technical ingredient of this paper is the following theorem (we show in Section [6] 
that it is essentially tight). 

Theorem 3.2 (Main Theorem (local version)). For any cycle C in the overlap graph, we 
have 

2M C + 70 c <HL C . 
By summing over all cycles of C we obtain the following. 
Theorem 3.3 (Main Theorem (global version)). 

2M C + 70 c < H^(C min ) < llOPT(S). 

The next two sections are devoted to the proof of Theorem 13.21 For now let us see what 
it implies. 
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Corollary 3.4. Shortest-Superstring can be approximated with factor [2 H — 9 _2 C J ■ ^ n 
particular for c = | we get 2^- approximation. 

Proof. It follows from Lemma 13.11 and Lemma 12,41 that 

\S Q \ < 20PT(S) + mm(M c , (1 - c)O c ). 
We can bound the second term as follows: 

min(M c , (1 - c)O c ) < |^M C + ^-(1 - c)O c < 1 -^^- OPT(S). 
For c = | we obtain = ±± and so |5 | < 2i§OPT(S). □ 

4 The General Bounds 

In this section we present and prove the bounds on overlaps of strings. We consider a set 
of non-equivalent nice strings w±, . . . , Wk, and for each i = 1, . . . , k a u^-string Xj. We use Zj 
to denote \wi\, and «j to denote a(wi). Moreover, for each i ^ j, i,j £ {1, ...,/«} we define 
ovjj = ov(xi,Xj) and Oy = \oVij\. Finally, let Wij be the rotation of W{ that matches ov(xi,Xj) 
from the left. If there is more than one such rotation (which might happen if Oij < Zj), choose 
any such rotation. 

By Lemma 12.71 we have o\2 < h + |Z 2 . The main theme of this section is characterizing 
situations in which this inequality is in some way non-tight. The underlying idea in most 
(but not all) of these results is the following: We show that if 0x2 is actually close to its 
upper-bound, then the set of possible starting positions of the maximal/minimal rotation of 
w\ is strongly limited, which in turn leads to an upper-bound on a±. This can be used to 
upper-bound other overlaps using Lemma 12.71 

We start with another lemma from the work of Breslauer et al. [5]. 

Lemma 4.1 (Implicit in the proof of Lemma 3.3 of Breslauer et al. [5]). If l\ < I2 then 
012 < h- For general h,h we have o\i < M2 whenever 1% < kfo. 

Proof. Assume for a contradiction that l\ < kl 2 and oyi > kl 2 - Also, w.l.o.g. assume that k 
is the smallest integer such that l\ < kl 2 - 

Similarly as in the proof of Lemma 12.71 we cannot have l\ = kl 2 , and we also cannot have 
l\ = (k — l)l 2 for the same reasons. Therefore (k — l)l 2 < l\ < kl 2 . 

Consider now the string ov\ 2 [li + l,kl 2 ] = ovi2[l,fc/2 — Zi]- This string is a non-trivial 
suffix of w 2 , and also a prefix of w 2 , a contradiction with w 2 being nice and Lemma 12.31 □ 

The next two lemmas demonstrate that o\ 2 getting close to l\ + \l 2 implies an upper- 
bound on the value of cx\. While Lemma 14.31 gives this bound explicitly, Lemma [4.21 describes 
it in terms of constraints on the starting positions of maximal and minimal rotations of w\ . 

Lemma 4.2. Let l\ > l 2 , o\ 2 > l 2 and let w 2 be its maximal rotation, then: 

• «max(^i2) = 1, or i max (w 12 ) = l 2 [ 2l jf 1 \ + I or i ma , x (w 12 ) > max(/ 2 L 2l f^ i J + l,oi 2 - 
a 2 + l). 
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• imin(«>i2) = a 2 +l, ori min (wi 2 ) = l 2 [ ° 12 ;° 2 - \+a 2 +l, or z mi n(wi 2 ) > max(/ 2 L ° 12 ; " 2 1 J + 
a 2 + 1,012 - (fe - a 2 ) + 1), 

Moreover, if o\ 2 > l\, then 2 max (u>i 2 ) / 1 and i m i n (wi2) = a 2 + l. 

Remark. Several of the lemmas appearing in the remainder of this section assume that one 
of the strings involved is its maximal rotation. In all cases a symmetrical statement is true as 
well, in which the roles of minimal and maximal rotations of all strings involved are reversed. 
We omit the corresponding statements in all these lemmas. 

Proof (of Lemma {4~lfy - Let us start with claims concerning ima, x (wi 2 ). Since we assume o\ 2 > 
l 2 , we know that w\ 2 contains p m&x {w 2 ). Therefore (wi2)max[l, a 2 ] >z Pm& x {w 2 ). We now 
consider two cases, depending on whether or not imax (1012) < 012 - oi 2 + 1. 
Case 1: If i max (wi 2 ) < o\ 2 - a 2 + 1, then i max (wi 2 ) + a 2 - 1 < o\ 2 and consequently 
(^i 2 )max[l, ct 2 ] ■< p ma , x (w 2 ). By the previous observation this is in fact an equality, and by 
Lemma E2]i max (w 2 ) = kl 2 + 1 for some natural k, i.e. the maximal rotation of w\ 2 is aligned 
with the starting position of some occurence of w 2 in 012- The two positions that appear 
in the statement of the lemma: a 2 + 1 and l 2 1 " 12 " 1 J + 1, are the starting positions of the 
leftmost and the rightmost occurences, respectively. We will show that i m ax(^i 2 ) has to be 
equal to one of them. 

Suppose that this is not the case. This means that (wi 2 ) = kl 2 + 1 and both 

(k — l)l 2 + 1 and (k + 1)?2 + 1 are in [1, . . . , 012]. Note that by Lemma 14.11 we then also have 
(k + 1)^2 < lii and in fact (k + 1)^2 < ^i since otherwise l\ would not be primitive. Therefore 
(k + l)l 2 + 1 < l\. Consider the rotations ri,r 2 ,r% of wi 2 starting at positions (k — l)l 2 + 1, 
kl 2 + 1 and (k + l)l 2 + 1, respectively. We have r\ = w 2 w 2 w, r 2 = w 2 ww 2 and r% = ww 2 w 2 for 
some string w . Since all rotations of a primitive string are different, we have ww 2 ^ w 2 w. If 
ww 2 >~ w 2 w, then r% is the largest of the three rotations. If, on the other hand, ww 2 ~< w 2 w, 
then r\ is the largest one. Therefore i m ax(^i 2 ) 7^ kl 2 + 1, a contradiction. 
Case 2: We are left with the case where i m ax(^i 2 ) > 012 — a 2 + 1, and we can also assume 
that we do not have i m ax(^i 2 ) = h L " 12 " 1 J + 1 since then the lemma clearly holds. 

We need to prove that imax(^i 2 ) > h L " 12 " 1 ] + 1- If that was not the case, then (k — l)l 2 + 
1 < *max(»i2) < kl 2 < o\ 2 for k = [ ° 12 ~ 1 j . Then W\ 2 \} ra & x (w\ 2 ) ,kl 2 \ is a non-trivial suffix of 
w 2 , and it is also a prefix of w 2 by maximality of the rotation of w\ 2 starting at i milx (vJi 2 ). 
But that is a contradiction with the fact that w 2 is unbordered by Lemma [2.31 This ends the 
proof of the bounds for i m ax(^i 2 )- 

One final claim we need to show concerning i ma , x {wi 2 ) is that if o\ 2 > l\, then i max (wi 2 ) ^ 
1. Since for o\ 2 > l\ we have l\ > l 2 , there are at least two positions of the form kl 2 + 1 
within w\ 2 . Consider rotations r\ and r 2 of w\ 2 starting at two consecutive such positions 
kl 2 + 1 and (k + l)/2 + 1. We will prove that r\ -< r 2 , which implies our claim. We have 
r i = w 2 ww2 and r 2 = ww^ 1 for some w which is a prefix of ■ In particular, we have 
ri = WVW2 for some v which is a rotation of w 2 . Since ri ^ r 2 by Lemma |2.2| and w 2 is its 
maximal rotation, we conclude that r\ -< r 2 . 

Let us now prove the claims concerning imin{wi 2 ). Similarly to the case of i ma , x {wi 2 ) we 
can argue that either i m j n (w\ 2 ) > o\ 2 - (l 2 - q 2 ) + 1 or w6 liave imin (w 12 ) = kl 2 + a 2 + 1 for 
some k. 

This time it will be more convenient to start with the case of o\ 2 > l\. Among rotations 
starting at positions of the form kl 2 + a 2 + 1, the one starting at a 2 + 1 is minimal in this 
case, and the proof is almost identical to the one we just presented for i max (wi 2 ). 
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So we only need to exclude the case where 012 > h and i m \a{w\2) > o\i — (I2 — 02) + 1. 
If that happened, then we would have (wi2)min[l, h — 02] = ^I2[*min('^i2) ; h]w, where w 
is a non-empty prefix of u^. Call this string r\ and let T2 = W12 [0-2 + l,h]- We claim 
that T2 < ri, which is a contradiction with minimality of t\. To see that T2 -< ri, note 
that Wi2[ct2 + l,a2 + h — *min(^i2)] ^ ^12 [*min('^i2)i ^1] by the definition of 02- Moreover 
W12R2 — \w\ + IJ2] ^ w because w is a prefix of (102) max- But we cannot have an equality 
here, since then w would also be a suffix of (w2)max, a contradiction with Lemma f2T3l 

Let us now prove the main claims concerning i m in(Vi2)- Again, we consider two cases. 
Case 1: If i m \ n < o\2 — (I2 — 02) + 1 an d consequently i m m{wi2) is of the form M2 + 0^2 + 1, 
then we can show that we have either « m in(wi2) = 02 + 1 or i m in(^i2) = h [ ° 12 ~^ 2 ~ 1 j -\- a2 -\- 1. 
The proof is almost identical to the one we provided for imax (^12). The only step that does 
not directly translate, is that the rightmost position of the form M2 + 0L2 + 1 within 0V12 is 
also within wu- Luckily, we already considered the case of 012 > h- 

Case 2: If 2 mm (iui2) > 012 — (h — 012) + 1, then we can also assume that we do not have 
imm{w 12 ) = h L 012 ";" 2 " 1 ] + «2 + 1- We need to show that imm(w 12 ) > h L^f^J + a 2 + 1. 
Again, the proof is almost identical to the one for « m ax(^i2)- D 

Lemma 4.3. Let £1,012 > h <ind let W2 be its maximal rotation. Then we always have 
«i < h + (h + 012 — 012) and moreover: 

1. either ct\ <l\+l2~ oyi, or 

2. the maximal rotation of wu starts at the rightmost position of the form H2 + 1, or the 
minimal rotation of W12 starts at the rightmost position of the form kl 2 + 012 + 1. 

Proof. Since we assume 012 > h, Lemma 14,21 describes all posibilities for ^01(1^12) and 
^max(^i2)- The rest is simple case analysis. 

If either i meLX {w 12 ) = feL^J + 1 or i min (w 12 ) = h j + a 2 + 1 (i.e. the second 

alternative in the statement of the lemma holds) then imin(^i2) is either at most 0:2 + 1 or at 
least 012 — h + 1 by Lemma H~2l and the same bounds hold for i m ax(wi2)- "Wrapping around" 
the end of w\2, they both land in an interval of length l\ — (012 — h) + «2 = h + (^1 + «2 — 012) 
and hence this quantity is also an upper bound on a\. 

If neither i max (wi 2 ) = ^Pf^J + l nor i min {w 12 ) = hl " 12 ^ 2 ' 1 \ +a 2 + l, then LemmaSJ 
gives even stronger bounds. We have i m in(^i2) < «2 + 1 or i mm (iui2) > 012 — (h — 02) + 1 
and the Scime bounds (in fact stronger) hold for imax (^12). Repeating the previous argument 
we get 

"i < h — (012 — (fa — 012)) + 012 = fa + fa — 012 ■ 

□ 

The next two lemmas state some of the consequences of Lemmas 14.21 and 14.31 that are 
particularly easy to use. 

Lemma 4.4. If fa > I2 then: 

1. o\2 + cti < h + fa for fa < 2l 2 , 

2. 012 + a x < 2/i - l 2 = fa + l 2 + (fa - 2l 2 ) for 2l 2 <h< \fa, 

3. 012 + ct\ < fa + h + OL2- 
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Proof. In the proof, we assume w.l.o.g. that w 2 is its maximal rotation. 

Let us first consider the case of l\ < 2l 2 . If 012 < h then we get the claim, since 
«i < hh < h- On the other hand, if 012 > h (note that in this case l\ > l 2 ), then by 
Lemma [O] we have i m in(^i2) = ol 2 + 1 and im^w^) G {/2 + 1} U (oi 2 - a 2 + 1, • • • ,h]- 
Therefore, we either have a\ < l 2 — a 2 or ol\ < (l\ — o\ 2 + 02) + a 2 and in both cases it is 
easy to verify that our claim is true. 

To prove the second inequality, we consider three cases: 
Case 1: If 012 < h then 

3 

012 + oli < li + ai < -l\ < 2li - l 2 . 
Case 2: If the first alternative in Lemma 14.31 holds, i.e. a\ < l\ + l 2 — o\ 2 then 

012 + ot\ < h + l 2 < 2/i - l 2 . 



Case 3: We are left with the case where the second alternative of Lemma 14.31 holds. Since 
012 > h and so i m i n (wi 2 ) = 02 + !) this means that z m ax(^i2) = 2Z2 + 1. It follows that 
Oi\ < (h — 2l 2 ) + a 2 and so 

012 + ai < (h + a 2 ) + {h - 2l 2 ) + a 2 < 2l x - l 2 . 

The third inequality of Lemma [4.41 follows immediately from the inequality ot\ <l 2 + (l± + 
ol 2 — 012) in the first part of Lemma 14.31 □ 

Let Aojj = (l{ + ^Ij) — 0{j and Aa^ = ^/j — on. These basically measure how much smaller 
Oij and ai are from their maximum values. 

Corollary 4.5. 

1. Aoi2 + Aai > Kh-h) ifh < 2l 2 , 

2. A012 + Aai > \(h - l 2 ) ifh > 3l 2 , and 

3. A012 + Aai > ~h)- 

Proof. For the first inequality, we have by Lemma 14.41 

3 1 3 1 1 

A012 + Aai = + 2 1 " 2 ~ 012 ~ ai - 2} 1 + 2 l2 ~ ll ~ l2 = 2^ ~ ^' 



For the second inequality, we have by Lemma [47 

31 11 113 1 

Aoia + Aai > -h + -l 2 -(h + l 2 + a 2 ) = -h~-l 2 -a 2 > ^h~h > ^h + ^h-h = ^(h-h)- 

Clearly, we only need to prove the third inequality for 2/2 < h < 3l 2 . We consider two 
cases: 

Case 1: If 2l 2 < h < \l 2 then 

Ao 12 + Aai > %h + \h - (2h - h) = \h - \h = \{h - h) + ^2 - \h > \{h - h)- 
2 2 2 2 b bob 

Case 2: If %h < U < 3/2 then we have 

1 5 1 



2 



A012 + Aai > rii + ^2 - {h + h + a 2 ) > ~h -l 2 >-h + ~- ~l 2 -h = -x(h- h)- 
2 2 2 b 2 3 b 



□ 
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The last lemma is not used in the proof of the main theorem. Nevertheless, we decided 
to include it in this section, as we believe it might turn out useful in future developments. 

Lemma 4.6. Let l\ < I2 and let wi be its maximal rotation. If oyi > l\ + 0C2 — cx\, then 
a\ < |a2 — kl\\ for all positive integers k. 

Remark. The most important consequence of the above lemma is that if I2 ~ 1l\, then we 
cannot have a\ ~ ^l\, 02 ~ \l2 and o\2 ~ l\ + 02 all happening at the same time. 

Proof. We have O12 > h ~ «1 > bmax(wi2)|, SO U> 2 [1, \Pmax(wi2)\] = {W2) K 2 )|] 

is a substring of ovi 2 , and in particular p m ax(wi2) h (u> 2 )max[l, |Pmax(^l2)|]- Therefore, if 
Pmax(wi2) is contained inovi 2 we need to have p max (wi 2 ) = 1012 [1, \Pmax{wu)\] and i max (wi 2 ) = 
1 by Lemma 12.21 If, on the other hand, p ma x(u ; i2) is not contained in OV12, then we have 
WcOi 2 ) > 012 - (h -ai) + 1 > a 2 + 1 and so i max (u>i 2 ) € [a 2 + 1, . . . ,Zi]. 

Similarly we can see that ovi 2 contains u; 2 [a 2 + l,a 2 + |p m in(Vi2)|]i and so if p m m(wi 2 ) 
is contained in OV12, we have p m m{wi 2 ) = X2\ct2 + l,ce 2 + |Pmin(^i2)|]> and by Lemma l2~2l 
iminiwu) = (a 2 + 1) mod h. Otherwise, we have i m in(wi2) G [a 2 + 1, • • • ,h]. 

It is easy to verify that in all cases for i m ax(^i2) and i m in(^i2) we get a± < |a2 — kl\\ for 
all positive integers k. □ 

5 The Proof of the Main Theorem 

In this section we present the proof of Theorem 13.21 We first introduce some additional 
definitions and technical lemmas, designed specifically for this proof, in Subsection 15. 11 Since 
the proof itself is a rather long and detailed case analysis, in Subsection l5.2l we present a simple 
proof of a weaker version of Theorem [321 This weaker statement still gives an approximation 
ratio below 2^. The proof of Theorem 13.21 follows, for easier reading split into a subsection 
covering some basic observations and four subsections corresponding to different cycle lengths. 

5.1 Preliminaries 

We keep the notation from previous chapters. In particular, for a cycle C = x\ — > X2 — > 
. . . — > Xf. — > aci, we are interested in bounding M = Mc = minjojj : (xi,Xj) € C} and 
O = Oc = Yl(xi Xj)tc °*i * n t erms of L = Lc = Yli=i h- Recall also, that Aoij = {h + \lj) — Oij 
and A«j = \k - ai. Let AO = Y^( Xi ,x )eC = 1. L - O. 

We now introduce a couple more definitions. We call an edge X ^ T X j Sl down- edge if 
U > lj and we call it an up-edge otherwise. We denote the sets of down-edges and up-edges 
of C by Cd and C u respectively. A down-edge x <i y Xj is steep if k > 21 j , otherwise it is flat. 
Similarly an up-edge X 2 r X j IS steep if li < T^lj, and flat otherwise. 

Finally let l m \ n and / max = l\ be the smallest and the largest among l\, . . . , breaking 
ties arbitrarily. 

Lemma 5.1. For any up-edge Xi — > Xj we have 

Aojj ^ li —lj ^ ^min 2^ max " 

Proof. The second inequality is obvious. 
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As for the first, there is nothing to prove for steep up-edges since then li — hlj < 0. For 
flat up-edges we have 

1. \ . , 1 



Aojj > [li + Ijj lj — l{ lj, 
by Lemma 14.11 □ 
Lemma 5.2. For any cycle C we have 

AO > — h) > J2 ( /max ~ /min ) • 

The constant can be improved to | if there are no two consecutive steep down-edges in C, and 
to |, if there are no steep down- edges in C . 

Proof. Let x% y Xj be db down-edge, and let x\ — > Xi be the edge preceding it on C. Then we 
get from Corollary 14.51 

Ao H + Aoij > Aa t + Ao tj >-(li~lj) (1) 

The right-hand side of the sum of inequality ([T]) over all down-edges is upper-bounded by 
2AO and the claim follows. 

If there are no steep down-edges in C, then this reasoning can be repeated using the 
sharper bound in Corollary 14.51 

Finally, if there are no two consecutive steep down-edges in C, then let C s be the set of 
steep down-edges and consider the sum of inequality (P) over C s . Since steep down-edges are 
nonconsecutive, the right-hand side of this inequality is upperbounded by AO, and so 



- E 

6 . ^ 



AO > - > [li — I 



3 



We can also slightly improve the first part of the proof by using a stronger bound for flat 
edges to obtain: 



ii ij 



(xi,Xj)£C s (xi,Xj)£Cd\C s 

Adding twice the first inequality to the second one, we get 



4AO>i £ (k-h) + \ E 



li lj i . 



(xi,Xj*)£C s {Xi,Xj*)£C(l\Cs 

and the claim follows. □ 
Lemma 5.3. If l m - m > gZ max then 

AO ^ ^^miri' 
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Proof. Note that in the proof of Lemma 15.21 we actually have 

2A0 ^l E + 5 E 0*-^) + E Ao ^- 

If ^min > jimax, then since all edges are flat and there is at least one down-edge (we excluded 
the case of all k equal) we obtain 

2A0 > — ^ ^ (jii IjJ ~\~ ^min ^max^J ^ ~~ ^max ^min^ "I - ^min ^max^j — "^j-mim 

(xi,xj)ec d 



by Lemma 15.11 and the claim follows. □ 
5.2 Proof of a Weaker Version of the Main Theorem 

In this subsection we present a weaker version of the Theorem 13. 2\ which is relatively easy to 
prove, and still leads to approximation factor smaller than 2i. 

Theorem 5.4 (Main Theorem, weak local version). For any cycle C in the overlap graph of 
R we have 

M c + 240 c < 36-Lc. 

Remark. Note that the emphasis here is on simplicity, and the proof below can easily be 
improved in many ways. 

Proof. We first prove that we always have AO > or^L, where k is the length of the cycle. 
We consider two cases: 

Case 1: If l max < 2l m - m then by Lemma I5T31 we have AO > |/ mm and so 



Case 2: If Z max > 2Z m i n then we have by Lemma [ 

AO > — (j-max ~ ^min^ — -^2(2k 1)" ~~ -0^ max — ~~ l)^min^ ^ 

" 12(2fc - 1) " 1)/max + Zmin ) " 12(2fc-l) L " 24fc L ' 
Prom the inequality we have just proved we get 

o<( 3 --J-)l. 

- \2 24k) 

We also always have 

M < i -O < — L. 

~ fc ~ 2fc 



Joining these gives 



17 1 " lf> S ( ^ + 36 - I) £ = (36 + i < 36ii. 



□ 
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Corollary 5.5. There exists a 2^||- approximation algorithm for Shortest-Superstring. 
Proof. Similarly as in the proof of Corollary 13.41 we get from Lemma [3. II and Lemma [2.41 that 

| So I < 20PT(S) + min (m, (l-^jO 

We can bound the second term as follows: 

1 , , 72 1 \ 36|L 145 . 145 



mm 



( M , l -0) < ( J-M + P • lo) < !S£ = ™ L < i?W(S), 
V '3 / ~ V73 73 3 / ~ 73 292 ~ 292 v ; ' 



and so \S \ < 2^0PT{S). □ 

5.3 The Proof of Theorem 13.21 — Basic Observations 

Let us recall Theorem 13.21 

Theorem 13.21 (restated). For every cycle C in the overlap graph of R, we have 

2M C + 70 c <11L C . 

We can easily get rid of the following special case, which will make some reasonings easier 
later on. 

Lemma 5.6. If all l{ are equal for a cycle C then the claim of Theorem \3.S\ holds. 

Proof. Since two non-equivalent strings of equal length I cannot have an overlap of length I 
or greater, it follows that in this case 7 < 1. Therefore (3 < ^7 < 5 an d 2/3 + 77 < 8, a much 
stronger bound than needed. □ 

In the remainder of this section we assume that not all U are equal. 

Lemma 5.7. Either of the following statements imply the claim of Theorem 1 3. S\ for a k-cycle 
C: 

• 2M - 7AO < \, 

Before proving the above lemma, let us note its particularly useful consequences: 
Corollary 5.8. Let C he a k-cycle, then the claim of Theorem 1 3. £1 for C 

• is implied by AO > if k = 4, 

• is implied by AO > j^L if k = 5, 

• holds if k > 6. 

Proof (of Lemma 5.1). For the first part we have 

2M + 70 = 2M + 7 - AO^j = lO^L + (2M - 7AO), 

and the claim follows. 

For the second part, note that M < \0. Therefore if we have AO > 2 (jk+2) ^> then 

~„ 2 + 7fc 2 + 7kf3 r 6-fc \ 2 + 7fc 22fc r 
2M + 70 < — O < — -L — -L = — — -L = 11L. 

k ~ k \2 2(7fc + 2) J k 2(7fc + 2) 

□ 
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5.4 The Proof of Theorem 13.21 for 5-cycles 

The remainder of the proof is divided into four parts, one for each cycle length in {2, 3, 4, 5}. 
Although there are similarities between these parts, they are mostly independent. For easier 
reading, we put each part in a separate subsection. 

Lemma 5.9. If C is 5-cycle, then 2M + 70 < 11L. 

Proof. We consider three cases. In all three we prove that AO > jgL and the claim follows 
from Corollary 15.81 

Case 1: If l m i n > ^/ max we have by Lemma 15.31 that AO > \l m i n and so 

AO > — (^ mm ) > (hnta + 4/max) > ^L. 

Case 2: If |Z max < 'mm < 2^ max > then we cannot have two consecutive steep down-edges, 
and so by Lemma 15.21 we have 

AO ^ — ylmax ^minj — Tj^ ^9/ max 9/ m ; n ^ ^ — — ^4Z max + / m j n ^ ^ 72^' 

Case 3: Finally, if / m j n < ^/ maX) then we have 

AO ^ (j'max 'min^ — ^6Z max 6/ m ; n ^ ^ — — ^4Z max + 2Z mm ^ > 72^ 



□ 



5.5 The Proof of Theorem 13.21 for 4-cycles 
Lemma 5.10. If C is 4-cycle, then 2M + 70 < ILL. 
Proof. We again consider several cases. 

Case 1: If l m - m > ^Z max then by Lemma 1531 we have AO > jl m m and so 

AO > — ^7Z mm ^ > — ^/ mm + 3Z max ^ > 28^' 



and the claim follows by Corollary 15. 8[ 

Case 2: If Z m ; n < ^l, but all down-edges of C are flat then we have by Lemma 15.21 that 
AO > j(Z max - /min) and so 

AO ^ 7^ ^7Z max TZminJ — l^inax "I - ^min^ ^ 28^ 

Therefore we only need to consider cases where at least one down-edge of C is steep. This 
is implicitly assumed in all remaining cases. 

Case 3: If l\ > 1% < Z3 > Z4, i.e. the edges of C are alternating down-up-down- up, then 
AO = ( A041 + A012) + f Ao 23 + Ao 34 N ) >\{h- h) + \(h- k) , 



6\ ) 6 



by Corollary 14.51 We also have 



.1/ < mil) ( On .m>:! ) + min | / j . /;-; )<-(/] + /:■; 
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Therefore 

2M-7AO <h + h + -(^-h + h-h + k) = + 7Z 2 -/3 + 7/ 4 ) < \(h + h + h + h 

since £2 < h and £4 < £3- 

Case 4: If l\ > l 2 < l 3 < h, then we have AO > |(Zi — Z2) an d M < 023 < /3. Therefore 

2M - 7 AO < 2l 3 - 7 -h + \h < (\k + \h + h) ~ \h + (^2 + ii) = ^L, 
and the claim follows from Lemma 15.71 

Case 5: If l\ > l 2 > h < h, we consider two subcases. Since we excluded Cases 1 and 2, at 
least one down-edge of C is steep. 

Case 5a: If l 2 < then AO > \(l x - l 2 ) and M < o 34 < l 3 + ±Z 4 . Therefore 

2M - 7AO < 2/3 + h + ^2 - ^1 < (7^3 + + (^4 + 7^1) + (^2 + ^1) - \h < \l 

Case 5b: If l 2 > \h and l 3 < ^l 2 , then AO > — Z3) by Lemma I5T21 and M < 034 < 
l 3 + 5Z4. Therefore 

2M-7AO < 2l 3 +h--h+-l 3 < y/ 3 +/4-g/l < (g/3+^2+^1) + (^4+^l) - 7^1 < \l. 

Case 6: We are left with the case where 1\>1 2 >1 3 > I4, i.e. C has three down-edges, and 
at least one of them is steep. We consider three subcases: 
Case 6a: If l 2 < then similarly to Case 3 we have 

AO > ( A041 + A012) + ( Ao 23 + Ao 34 ) >-(h- h) + \{h- k 



6\ J 6 



We also have M < 034 < l 3 + Therefore 



2M - 7 AO < 2/3 + k + 77 ( - h + h ~ h + h ) = —xh + 77 h + xh + -rrh < 

b V / b b b 

7, /3, 2,\ /3, 1 \ /3, 5, \ 1 

Case 6b: If l 3 < \l 2 then AO > \{l 2 - l 3 ) and M < l 3 + ±Z 4 . Therefore 

7 7 7 19 7 22 1 

2M - 7AO < 2/3 + h- -h + 77/3 = -7^2 + -77-/3 + < -77/2 + -77-/3 + 77/4 < 

boob b b 2 

^ 7, /3, 13, 3,\ 1, ^ 1 r 
^-b^ + (b /l + ^ 2+ o /3 ) + 2^2 L - 

Case 6c: If I4 < 5Z3 then similarly to Case ba we have 

AO >i(z!-Z 2 )+ 7^3-/4), 

but this time we use the bound M < 041 < I4 + We get 

7 / \ 1 7 7 19 

2M -7AO<2U + h + -(-h + l 2 -l 3 + lA =~-h + -h ~ -h + tt/4 < 

b V / b b b b 

1, /3, 4, \ 7, /3, 8, \ 1 

< _ t + (_ i2 + _ _, 3 + (_ l4 + _, 3 ) < _ L . 



□ 
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5.6 The Proof of Theorem 13.21 for 3-cycles 
Lemma 5.11. If C is a 3-cycle then 2M + 70 < 11L. 

Proof. There are essentially two kinds of 3-cycles - ones with (cyclically) increasing Zj, and 
ones with decreasing Zj. 

Case 1: If l\ > Z 2 < Z3 (i.e. Zj are cyclically increasing), then we consider three subcases. 
Case la: If Z max < 2Z m i n , i.e. l\ < 2/2, then 

AO > Ao 23 + A031 > (l 2 - \h] + (h ~ hi) = h + \h ~ \h > lh 



by Lemma |5. 11 We also have M < 023 < Z3. Then 



7 1 
2M - 7AO < 2/3 - -l 3 < < -L. 



Case lb: If l\ > 2Z3 then we have 

AO > A012 > \{h-h 
and M < 023 < Z2 + IZ3. Therefore 

2M-7AO < 2Z 2 +Z 3 +g(-Zi+Z 2 ) < -^i + yZ 2 +Z 3 < -~h+ (^2+^1) + (^3+^i) < \l. 
Case lc: If 2Z 2 < h < 2Z 3 then 

AO > max (A012, Ao 3 ij > max ^- (Zi - l^j , Z 3 - h, x 



and M < 023 <h + \h- Hence 

2M-7AO < 2Z2 + Z3-(z3-^Z 1 )-6-i(z 1 -Z 2 ) = -^Z! + 3Z 2 < ~h+ (Zi + ^2 + ^3) < l -L. 

Case 2: If l\ > Z 2 > Z3 then we consider several subcases. The logic in their ordering is that 
we are trying to eliminate the easy ones first until only the hardest case remains — one that 
is actually tight. 

Case 2a: If l\ < 2Z3, i.e. Z max < 2Z m ; n , then by Lemma 1531 we have AO > |Z 3 . We also have 
M < min(o23,03i) < min(Z2 + \h,h)- Therefore 

2M - 7AO < \ (Z 2 + \h) + \h - \h = \h + \h - \h < \L. 

Case 2b: If l\ > 2Z3, but both down-edges are flat, then by Lemma 15.21 we have AO > 
\(h — h)- Using the same bound on M as in the previous case, we obtain 

2M - 7 AO <\(h + Ik) +hi~l (h ~ k) = -hi + \h + 2Z 3 < \h. 



4 2 2 
We are left with the case where at least one down-edge of C is steep. 
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Case 2c: If I2 < \l\ then we have AO > \{h — h) by Corollary 14. 51 We also have 
M <(>2z<h + \h- Hence 

2M-7AO <2l2+l 3 --h + ^h = ~h + ^h + h < ~\h + (^2 + ^1) + (^3 + ^i) < \l. 
Case 2d: If I2 > \h and 2/3 <h<\h then we have 

AO > Aoi 2 + Ao 23 > Aa 2 + Ao 23 > (~Z 2 + ^3) - (2/2 - h) = \h ~ \h 
by Lemma 231 and we also have 

AO>Aoi2>-(li-h). 
Joining these two bounds with M < 031 < + gives 

2M-7AO < 2/3+^1-6(^3-^2) -l(h-h) = \h+\h-7h < ^i+(^2+y^)-7Z 3 = \l. 
Case 2e: Finally if I2 > \l\ and I2 > |/ 3 then we have 

AO > Ao 12 + Ao 23 > Aa 2 + Ao 23 > (-Z 2 + rfe) - + h + as) > \h - h 
by Lemma 14.41 We now proceed similarly to the previous case: 

2M - 7AO < 2/3 + h ~ 6(^2 (h ~ h) < \h ~ \h + 8Z3 = \ L ~ ?>h + y/ 3 < \l. 

□ 

5.7 The Proof of Theorem EH for 2-cycles 

Before we proceed with the case of 2-cycles, we need an additional technical lemma. 
Lemma 5.12. If l\ > 2/2 then A012 + A021 > \l 2 - 

Proof. If oyi < h then clearly AO > A012 > \l 2 - Hence we can assume 012 > h- Note that 
this means that l\ is not a multiple of I2, since w\ is primitive. 

Assume w.l.o.g. that W2 is its maximal rotation and let k > 2 be such, that /c/2 < /1 < 
(fc + l)/ 2 - Since 012 > ii, by Lemma[I2]we get iminfV^) = «2 + 1 and i ma x(^i2) > kl 2 + 1. 
This means that |j> m in(^i)| > kl 2 -0*2 = (k — l)h + {h - a 2 ) > \h and |p m ax(^i)| < 
l\ — kl 2 + 0L2 < f^2- Therefore, w\ is its maximal rotation as well. 

Since i m ax('Wi2) > fc/2 + 1 and Z2 does not divide l\, we know p m ax(^i) = wp max (w2), 
where w = Wi2^ m ax(^i2), Note that \w\ < \w2\ and w is a prefix of W2 (because it is an 
initial segment of a maximal rotation of w\). 

We will show that 021 < Ipmax^i)! = «i ; which implies the claim of the lemma. Assume 
the opposite, i.e. 021 > u\. Then 02i[l,«i] =Pmax(twi) = ^Pmax(^2)- By Lemma [231 this can 
only happen if W2 is aligned with position \w\ + 1 of OV21. But then w is a suffix of W2, and 
since it is also a prefix of W2 we get a contradiction with Lemma 12.31 □ 
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Lemma 5.13. If C is a 2-cycle, then 2M + 70 < ILL. 
Proof. We consider three cases. 

Case 1: If l\ < 2Z 2 , i.e. the down-edge of C is fiat, then we have 

AO > max (- (z x - Z 2 ) ,k - ^1) 
by Corollary 14.51 and Lemma 15. II We also have M < 021 < Zi, and so 

2M-7AO<2Zi-5-i(Zi-Z 2 ) -2(z 2 -izi) = ^L. 

Case 2: If 2Z 2 < Zi < 3Z 2 then we have AO > ^Z 2 by Lemma 15.121 This easily gives our 
main claim since using M < o 2 i < Z 2 + \l\ we have 

2M - 7 AO < 2Z 2 + Zi - ^Z 2 = h - h 2 < + ^Z 2 ) - ^Z 2 < h. 

Case 3: If 3Z 2 < Zi then by Corollary 14.51 we have AO > — Z 2 ) and together with 
M < 021 < Z 2 + |Zi we get 

7 7 3 15 3 5 1 

2M - 7AO < 2Z 2 + Zi - -Zi + -Z 2 = --Zi + — Z 2 = --Zi + -h < -L. 

□ 

6 Tight examples 

6.1 Tightness of Theorem 13.21 

We will now show that Theorem 13.21 is essentially tight. To this end, we give two examples of 
cycles in the overlap graph, for which 2M + 70 = 11L — 0(1). Note that by increasing the 
lengths of the strings in these cycles we can get 2M {[1° — > L 

Example 6.1. Let w% = ba k b\a k+l ba k+l and u> 2 = a k+1 \ba k b (we use the symbol \ to mark 
the border between p max and Pmi n ). Here l\ = 3k + 5, Z 2 = 2k + 3, so L = 5k + 8. 

Now, let Xi = (wf)[l, 2Zj — 1] for i = 1,2. Note that all wi are nice words and every Xi is 
a Wi-word. 

We have o i2 = 4A; + 5 and o 2 \ = 3k + 4, so O = 7k + 9 and M = 3k + 4. Note that 
2M + 70 = 6k + 8 + 49k + 63 = 55/c + 71 = 11L - 0(1) 

Example 6.2. Let wi = ba n ba n+1 ba n b\a n+1 ba n+1 ba n+1 , w 2 = a n+1 ba n+1 \ba n ba n+1 ba n b, 
W3 = a n+1 \ba n b. We have l x = 6n+ 10, Z 2 = 5n + 8, Z 3 = 2n + 3, so L = I3n + 21. 

Now, let xi = (wf) [l,2/i - 1], x 2 = (w|)[l,2Z 2 + a 2 - 1] and x 3 = (w|)[l,4Z 3 - 1]. Note 
that all Wi are nice words and every Xi is a w-i-word. 

We have o\ 2 = 8n + 12, o 23 = 6n + 8 and o 3 i = 5n + 7, so O = 19n + 27 and M = 5n + 7. 
iVote fZiai 2M + 70 = lOn + 14 + 133n + 189 = 11L - O(l). 

We will now show that the bound we give on min(M, (1 — c)0) in Corollary 13.41 is also 
essentially tight. 
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Consider a cycle cover C in the overlap graph, composed of two collections of cycles: C2 
consisting of 2-cycles of the form described in Example 16.11 and C 3 consisting of 3-cycles 
described in Example 16.21 (note that these cycles need to use different n so that their vertices 
are non-equivalent). 

Let L2,Ls be the total length of the periods of the strings in the cycles of C2 and C3, 
respectively. Let O2 be the sum of all overlaps on the cycles in C2 and let M2 be the sum 
of smallest overlaps for each cycle in €2- Similarly define O3 and M3 for C3. Finally let 
L = L 2 + L 3 , O = 2 + 03 and M = M 2 + M 3 . 

Note that to make the analysis in Corollary 13.41 tight we only need to make M = (1 — c)0, 
since we already have 2M + 70 = 11L — 0(1). Since M 2 ~ § 2 and M 3 ~ l0 3 , this can 
be done by adjusting the balance between L2 and L 3 , provided that c € [| , The current 
best approximation ratio of | for Max-ATSP-Path sits well within this interval. 

6.2 The Greedy Algorithm 

Recall the greedy algorithm, which picks two strings with the largest overlap and combines 
them together until a single string remains. The bounds in Breslauer et al. can be used to 
improve the analysis of this algorithm, as shown by Kaplan et al. [llj . It is natural to ask 
whether our bounds can be used in a similar fashion. Unfortunately, it seems that there is 
no simple way to do this. In their analysis, Kaplan et al. require a good bound on the the 
total overlap of a (possibly) long path of strings in the overlap graph. As it turns out, in this 
case the overlap can actually approach the bound of | ^ i U arbitrarily close, as can be seen 
in the following example. 

Example 6.3. For any k > 1 let W2k = b k \a k and W2k-i = a k ~ l \b k . Also, let X2k = 
b k a k b k a k ~ 1 and X2k-i = a k ~ 1 b k a k ~ 1 b k ~ 1 . Note that all Wi are nice words and every Xi is a 
Wi-word. 

Consider S = {x 3 , x 4 , . . . , x n } = {ab 2 ab,b 2 a 2 b 2 a, a 2 b 3 a 2 b 2 ,b 3 a 3 b 3 a 2 , . . .} and the path 
x n — > x n _i 4 ... -> 13 in the overlap graph of S. It is easy to verify that Oi + i^ = [yj- 
Therefore, the total overlap of the path is approximately \n 2 , and ^22=3^ ~ \ n2 ■ 

This, of course, does not rule out using our results to improve the analysis of the greedy 
algorithm. However, any such result requires some additional insight. 
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